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Abstract 

An  effective  approach  of  simulating  fluid  dynamics  on  a  cluster  of  non-dedicated  workstations  is  presented. 
The  approach  uses  local  interaction  algorithms,  small  communication  capacity,  and  automatic  migration 
of  parallel  processes  from  busy  hosts  to  free  hosts.  The  approach  is  well-suited  for  simulating  subsonic 
flow  problems  which  involve  both  hydrodynamics  and  acoustic  waves;  for  example,  the  flow  of  air  inside 
wind  musical  instruments.  Typical  simulations  achieve  80%  parallel  efficiency  (speedup/processors)  using 
20  HP-Apollo  workstations.  Detailed  measurements  of  the  parallel  efficiency  of  2D  and  3D  simulations  are 
presented,  and  a  theoretical  model  of  efficiency  is  developed  which  fits  closely  the  measurements.  Two  nu¬ 
merical  methods  of  fluid  dynamics  are  tested:  explicit  finite  differences,  and  the  lattice  Boltzmann  method. 
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Figure  1:  Simulation  of  a  flue  pipe  using  20  workstations 
in  a  (5  X  4)  decomposition. 

1  Introduction 

An  effective  approach  of  simulating  fluid  dynamics  on  a 
cluster  of  non-dedicated  workstations  is  presented.  Con¬ 
currency  is  achieved  by  decomposing  the  simulated  area 
into  rectangular  subregions,  and  by  assigning  the  sub- 
regions  to  parallel  subprocesses.  The  use  of  local  in¬ 
teraction  methods,  namely  explicit  numerical  methods, 
leads  to  small  communication  requirements.  The  paral¬ 
lel  subprocesses  automatically  migrate  from  busy  hosts 
to  free  hosts  in  order  to  exploit  the  unused  cycles  of 
non-dedicated  workstations,  and  to  avoid  disturbing  the 
regular  users  of  the  workstations.  The  system  is  straight¬ 
forwardly  implemented  on  top  of  UNIX  and  TCP/IP 
communication  routines. 

Typical  simulations  achieve  80%  parallel  efficiency 
(speedup/processors)  using  20  HP- Apollo  workstations 
in  a  cluster  where  there  are  25  non-dedicated  worksta¬ 
tions  total.  Detailed  measurements  of  efficiency  in  sim¬ 
ulating  two  and  three-dimensional  flows  are  presented, 
and  a  theoretical  model  of  efficiency  is  developed  which 
fits  closely  the  measurements.  Two  numerical  methods 
of  fluid  dynamics  are  tested:  finite  differences  and  the 
lattice  Boltzmann  method.  Further,  it  is  shown  that 
the  shared-bus  Ethernet  network  is  adequate  for  two- 
dimensional  simulations  of  fluid  dynamics,  but  limited 
for  three-dimensional  ones.  It  is  expected  that  new  tech¬ 
nologies  in  the  near  future  such  as  Ethernet  switches, 
FDDI  and  ATM  networks  will  make  practical  three- 
dimensional  simulations  of  fluid  dynamics  on  a  cluster 
of  workstations. 

The  present  approach  is  well-suited  for  simulating  sub-  ^ 


sonic  flow  problems  which  involve  both  hydrodynamics 
and  acoustic  waves;  for  example,  the  flow  of  air  inside 
wind  musical  instruments.  This  is  because  such  prob¬ 
lems  favor  the  use  of  explicit  numerical  methods  versus 
implicit  ones,  as  explained  below  and  in  section  6.  The 
use  of  explicit  methods  is  desirable  for  parallel  comput¬ 
ing  on  a  cluster  of  workstations  because  explicit  methods 
have  small  communication  requirements.  Thus,  there  is 
a  good  match  between  the  nature  of  the  problem,  the 
use  of  explicit  methods,  and  the  parallel  system. 

The  choice  between  explicit  and  implicit  numerical 
methods  is  a  recurring  theme  in  scientific  computing. 
Explicit  methods  are  local,  ideally  scalable,  and  require 
small  integration  time  steps  in  order  to  remain  numer¬ 
ically  stable.  By  contrast,  implicit  methods  are  chal¬ 
lenging  to  parallelize,  have  large  communication  require¬ 
ments,  but  they  can  use  much  larger  integration  time 
steps  than  explicit  methods.  Because  of  these  differ¬ 
ences  between  explicit  and  implicit  methods,  the  deci¬ 
sion  which  method  to  use  depends  on  the  available  com¬ 
puter  system,  and  on  the  requirements  of  the  problem  on 
the  integration  time  step.  For  instance,  the  simulation 
of  subsonic  flow  requires  small  integration  time  steps  in 
order  to  follow  the  feist-moving  acoustic  waves.  Thus, 
subsonic  flow  is  a  good  problem  for  explicit  methods. 

1.1  Comparison  with  other  work 

The  suitability  of  local  interaction  algorithms  for  parallel 
computing  on  a  cluster  of  workstations  has  been  demon¬ 
strated  in  previous  works,  such  as  [1],  [2],  and  elsewhere. 
Cap&Strumpen  [1]  present  the  PARFORM  system  and 
simulate  the  unsteady  heat  equation  using  explicit  fi¬ 
nite  differences.  Chase&et  al.  [2]  present  the  AMBER 
parallel  system,  and  solve  Laplace’s  equation  using  Suc¬ 
cessive  Over-Relaxation.  The  present  work  emphasizes 
and  clarifies  further  the  importance  of  local  interaction 
methods  for  parallel  systems  with  small  communication 
capacity.  Furthermore,  a  real  problem  of  science  and 
engineering  is  solved  using  the  present  approach.  The 
problem  is  the  simulation  of  subsonic  flow  with  acoustic 
waves  inside  wind  musical  instruments. 

In  the  fluid  dynamics  community,  very  little  atten¬ 
tion  has  been  given  to  simulations  of  subsonic  flow  with 
acoustic  waves.  The  reason  is  that  such  simulations 
are  very  compute-intensive,  and  can  be  performed  only 
when  parallel  systems  such  as  the  one  described  here 
are  available.  Further,  the  fluid  dynamics  community 


has  generally  shunned  the  use  of  explicit  methods  until 
recently  because  explicit  methods  require  small  integra¬ 
tion  time  steps  to  remain  numerically  stable.  With  the 
increasing  availability  of  parallel  systems,  explicit  meth¬ 
ods  are  slowly  attracting  more  attention.  The  present 
work  clearly  reveals  the  power  of  explicit  methods  in  one 
particular  example,  and  should  motivate  further  work  in 
this  direction. 

Regarding  the  experimental  measurements  of  parallel 
efficiency  which  are  presented  in  section  7,  they  are  more 
detailed  than  in  any  other  reference  known  to  the  author, 
especially  for  the  case  of  a  shared-bus  Ethernet  network. 
The  model  of  parallel  efficiency  which  is  discussed  in 
section  8  is  based  on  ideas  which  have  been  discussed 
previously,  for  example  in  Fox  et  al.  [3]  and  elsewhere. 
Here,  the  model  is  derived  in  a  clear  and  direct  way, 
and  moreover  the  predictions  of  the  model  are  compared 
against  experimental  measurements  of  parallel  efficiency. 

Regarding  the  problem  of  using  non-dedicated  work¬ 
stations,  the  present  approach  solves  the  problem  by 
employing  automatic  process  migration  from  busy  hosts 
to  free  hosts.  An  alternative  approach  that  has  been 
used  elsewhere  is  the  dynamic  allocation  of  processor 
workload.  In  the  present  context,  dynamic  allocation 
means  to  enlarge  and  to  shrink  the  subregions  which 
are  assigned  to  each  workstation  depending  on  the  CPU 
load  of  the  workstation  (CapfcStrumpen  [1]).  Although 
this  approach  is  important  in  various  applications  (Blu- 
mofe&Park  [4]),  it  seems  unnecessary  for  simulating  fluid 
flow  problems  with  static  geometry.  For  such  problems, 
it  may  be  simpler  and  more  effective  to  use  fixed  size  sub- 
regions  per  processor,  and  to  use  automatic  migration  of 
processes  from  busy  hosts  to  free  hosts.  The  latter  ap¬ 
proach  has  worked  very  well  in  the  parallel  simulations 
presented  here. 

Regarding  the  design  of  parallel  simulation  systems, 
the  present  work  aims  for  simplicity.  In  particular, 
the  special  constraints  of  local  interaction  problems  and 
static  decomposition  have  guided  the  design  of  the  par¬ 
allel  system.  The  automatic  migration  of  processes  hcis 
been  implemented  in  a  straightforward  manner  because 
the  system  is  very  simple.  The  availability  of  a  homoge¬ 
neous  cluster  of  workstations,  and  a  common  file  system 
have  also  simplified  the  implementation,  which  is  based 
on  UNIX  and  TCP/IP  communication  routines.  The  ap¬ 
proach  presented  here  works  well  for  spatially-organized 
computations  which  employ  a  static  decomposition  and 


local  interaction  algorithms. 

The  approach  presented  here  does  not  deal  with  is¬ 
sues  such  as  high-level  distributed  programming,  paral¬ 
lel  languages,  inhomogeneous  clusters,  and  distributed 
computing  of  general  problems.  Efforts  along  these  di¬ 
rections  are  the  PVM  system  (Sunderam  [5]),  the  Linda 
system  (Carrier©  [6]),  the  packages  of  (Kohn&Baden  [7]) 
and  (Chesshire&:Naik  [8])  that  facilitate  a  parallel  de¬ 
composition,  the  Orca  language  for  distributed  comput¬ 
ing  (Bal&et  al.  [9]),  etc. 

1,2  Outline 

Section  2  presents  some  examples  of  parallel  simulations 
which  demonstrate  the  power  of  the  present  approach, 
and  also  help  to  motivate  the  subsequent  sections.  Sec¬ 
tion  3  reviews  parallel  computing  and  local  interaction 
problems  in  general.  Sections  4  and  5  describe  the  im¬ 
plementation  of  the  parallel  simulation  system,  including 
the  automatic  migration  of  processes  from  busy  hosts  to 
free  hosts.  Section  6  explains  the  parallelization  of  nu¬ 
merical  methods  for  fluid  dynamics.  Finally,  sections  7 
and  8  present  experimental  measurements  of  the  perfor¬ 
mance  of  the  parallel  system,  and  develop  a  theoretical 
model  of  parallel  efficiency  for  local  interaction  problems 
which  fits  well  the  measured  efficiency.  Most  ideas  are 
discussed  as  generally  as  possible  within  the  context  of 
local  interaction  problems,  and  the  specifics  of  fluid  dy¬ 
namics  are  limited  to  section  2  and  section  6. 

2  Examples  of  flow  simulations 

The  parallel  system  has  been  successfully  applied  to  sim¬ 
ulate  the  flow  of  air  inside  flue  pipes  of  wind  musical  in¬ 
struments  such  cLS  the  organ,  the  recorder,  and  the  flute. 
This  is  a  phenomenon  that  involves  the  interaction  be¬ 
tween  hydrodynamic  flow  and  acoustic  waves:  When  a 
jet  of  air  impinges  a  sharp  obstacle  in  the  vicinity  of  a 
resonant  cavity,  the  jet  begins  to  oscillate  strongly,  and 
it  produces  audible  musical  tones.  The  jet  oscillations 
are  reenforced  by  a  nonlinear  feedback  from  the  acoustic 
waves  to  the  jet.  Similar  phenomena  occur  in  human 
whistling  and  in  voicing  of  fricative  consonants  (Sha- 
dle  [10]).  Although  sound-producing  jets  have  been  stud¬ 
ied  for  more  than  a  hundred  years,  they  remain  the  sub¬ 
ject  of  active  research  (Verge94  [11,  12],  Hirschberg  [13]) 
because  they  are  very  complex. 

Using  our  distributed  system  we  can  simulate  jets 
of  air  inside  flue  pipes  using  uniform  orthogonal  grids 


Figure  2:  Simulation  of  a  flue  pipe  using  15  workstations 
in  a  (6  X  4)  decomposition  with  9  subregions  inactive. 

as  large  as  1200  x  1200  in  two  dimensions  (1.5  million 
nodes).  We  typically  employ  smaller  grids,  however, 
such  as  800  x  500  (0.38  million  nodes)  in  order  to  re¬ 
duce  the  computing  time.  For  example,  if  we  divide  a 
800  X  500  grid  into  twenty  subregions  and  assign  each 
subregion  to  a  different  HP9000/700  workstation,  we  can 
compute  70,000  integration  steps  in  12  hours  of  run  time. 
This  produces  about  12  milliseconds  of  simulated  time, 
which  is  long  enough  to  observe  the  initial  response  of  a 
flue  pipe  with  a  jet  of  air  that  oscillates  at  1000  cycles 
per  second. 

Figure  1  shows  a  snapshot  of  a  800  x  500  simulation 
of  a  flue  pipe  by  plotting  equi-vorticity  contours  (the 
curl  of  fluid  velocity).  The  decomposition  of  the  two- 
dimensional  space  (5x4)  =  20  is  shown  eis  dashed  lines 
superimposed  on  top  of  the  physical  region.  The  gray 
arecis  are  walls,  and  the  dark-gray  area-s  are  walls  that 
enclose  the  simulated  region  and  demarcate  the  inlet  and 
the  outlet.  The  jet  of  air  enters  from  an  opening  on  the 
left  wall,  impinges  the  sharp  edge  in  front  of  it,  and  it 
eventually  exits  from  the  simulation  through  the  opening 
on  the  right  part  of  the  picture.  The  resonant  pipe  is 
located  at  the  bottom  part  of  the  picture. 

Figure  2  shows  a  snapshot  of  another  simulation  that  , 
uses  a  slightly  different  geometry  than  figure  1.  In  par¬ 
ticular,  figure  2  includes  a  long  channel  through  which 
the  jet  of  air  must  pass  before  impinging  the  sharp  edge. 
Also,  the  outlet  of  the  simulation  is  located  at  the  top  of 
the  picture  as  opposed  to  the  right.  This  is  convenient 
because  the  air  tends  to  move  upwards  after  impinging 
the  sharp  edge.  Overall,  figure  2  is  a  more  realistic  model  , 


of  flue  pipes  than  figure  1. 

From  a  computational  point  of  view  the  geometry  of 
figure  2  is  interesting  because  there  are  subregions  that 
are  entirely  gray,  i.e.  they  are  entirely  solid  walls.  Con¬ 
sequently,  we  do  not  need  to  assign  these  subregions  to 
any  workstation.  Thus,  although  the  decomposition  is 
(6  x  4)  =  24  ,  we  only  employ  15  workstations  for  this 
problem.  In  terms  of  the  number  of  grid  nodes,  the  full 
rectangular  grid  is  1107  x  700  or  0.7  million  nodes,  but 
we  only  simulate  15/24  of  the  total  nodes  or  0.48  million 
nodes.  This  example  shows  that  an  appropriate  decom¬ 
position  of  the  problem  can  reduce  the  computational 
effort  in  some  cases,  as  well  as  provide  opportunities  for 
parallelism.  More  sophisticated  decompositions  can  be 
even  more  economical  than  ours;  however,  we  prefer  to 
use  uniform  decompositions  and  identical-shaped  sub- 
regions  in  our  current  implementation  for  the  sake  of 
simplicity. 

We  have  performed  all  of  the  above  simulations  us¬ 
ing  the  lattice  Boltzmann  numerical  method.  We  will 
describe  further  this  method  and  other  issues  of  fluid 
dynamics  in  section  6.  Next,  we  review  the  basics  of 
local  interaction  problems,  and  we  describe  the  imple¬ 
mentation  of  our  distributed  system.  These  issues  are 
important  for  understanding  in  detail  how  our  system 
works  and  why  it  works  well. 

3  Local  interaction  computations 

We  define  a  local  interaction  computation  as  a  set  of 
“parallel  nodes”  that  can  be  positioned  in  space  so  that 
the  nodes  interact  only  with  neighboring  nodes.  For  ex¬ 
ample,  figure  3  shows  a  two-dimensional  space  of  parallel 
nodes  connected  with  solid  lines  which  represent  the  lo¬ 
cal  interactions.  In  this  example,  the  interactions  extend 
to  a  distance  of  one  neighbor,  and  have  the  shape  of  a 
star  stencil,  but  other  patterns  of  local  interactions  are 
also  possible.  Figure  4  shows  two  typical  interactions 
which  extend  to  a  distance  of  one  neighbor,  a  star  sten¬ 
cil  and  a  full  stencil. 

The  parallel  nodes  of  a  local  interaction  problem  are 
the  finest  grain  of  parallelism  that  is  available  in  the 
problem;  namely,  they  are  the  finest  decomposition  of 
the  problem  into  units  that  can  evolve  in  parallel  af¬ 
ter  communication  of  information  with  their  neighbors. 
In  practice,  the  parallel  nodes  are  usually  grouped  into 
subregions  of  nodes,  as  shown  in  figure  3  by  the  dashed 
lines.  Each  subregion  is  assigned  to  a  different  proces- 
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Figure  3:  A  problem  of  local  interactions  in  two  dimen¬ 
sions,  and  its  decomposition  (2x2)  into  four  subregions. 

sor,  and  the  problem  is  solved  in  parallel  by  executing 
the  following  sequence  of  steps  repeatedly, 

•  Calculate  the  new  state  of  the  interior  of  the  sub- 
region  using  the  previous  history  of  the  interior  as 
well  as  the  current  boundary  information  from  the 
neighboring  subregions. 

•  Communicate  boundary  information  with  the 
neighboring  subregions  in  order  to  prepare  for  the 
next  local  calculation. 

The  boundary  which  is  communicated  between  neigh¬ 
boring  subregions  is  the  outer  surface  of  the  subregions. 
Section  4.2  describes  a  goo.d  way  of  organizing  this  com¬ 
munication. 

Local  interaction  problems  are  ideal  for  parallel  com¬ 
puting  because  the  communication  is  local,  and  also  be¬ 
cause  the  amount  of  communication  relative  to  computa¬ 
tion  can  be  controlled  by  varying  the  decomposition.  In 
particular,  when  each  subregion  is  as  small  as  one  node 
(one  processor  per  node),  there  is  maximum  parallelism, 
and  a  lot  of  communication  relative  to  the  computation 
of  each  processor.  As  the  size  of  each  subregion  increases 
(which  is  called  “coarse-graining’" ),  both  the  parallelism 
and  the  the  amount  of  communication  relative  to  com¬ 
putation  decrease.  This  is  because  only  the  surface  of 
a  subregion  communicates  with  other  subregions.  Even¬ 
tually,  when  one  subregion  includes  all  the  nodes  in  the 
problem,  there  is  no  parallelism  and  no  need  for  commu¬ 


Figure  4:  A  star  stencil  and  a  full  stencil  represent  two 
typical  nearest  neighbor  local  interactions. 

nication  anymore.  Somewhere  between  these  extremes, 
we  often  find  a  good  match  between  the  size  of  the  subre¬ 
gion  (the  “parallel  grain  size”)  and  the  communication 
capabilities  of  the  computing  system.  This  is  the  rea¬ 
son  why  local  interaction  problems  are  very  flexible  and 
highly  desirable  for  parallel  computing. 

4  The  distributed  system 

The  design  of  our  system  follows  the  basic  ideas  of  local 
interaction  parallel  computing  that  we  discussed  above. 
In  this  section,  we  describe  an  implementation  which  is 
based  on  UNIX  and  TCP/IP  communication  routines. 
Our  implementation  also  exploits  the  common  file  sys¬ 
tem  of  the  workstations. 

4.1  The  main  modules 

For  the  sake  of  programming  modularity,  we  organize 
our  system  into  the  following  four  modules: 

♦  The  initialization  program  produces  the  initial  state 
of  the  problem  to  be  solved  as  if  there  was  only  one 
workstation. 

♦  The  decomposition  program  decomposes  the  initial 
state  into  subregions,  generates  local  states  for  each 
subregion,  and  saves  them  in  separate  files,  called 
“dump  files” .  These  files  contain  all  the  information 
that  is  needed  by  a  workstation  to  participate  in  a 
distributed  computation. 

♦  The  job-submit  program  finds  free  workstations  in 
the  cluster,  and  begins  a  parallel  subprocess  on  each 
workstation.  It  provides  each  process  with  a  dump 
file  that  specifies  one  subregion  of  the  problem.  The 
processes  execute  the  same  program  on  different 
data. 


•  The  monitoring  program  checks  every  few  minutes 
whether  the  parallel  processes  are  progressing  cor¬ 
rectly.  If  an  unrecoverable  error  occurs,  the  dis¬ 
tributed  simulation  is  stopped,  and  a  new  simula¬ 
tion  is  started  from  the  last  state  which  is  saved 
automatically  every  10  —  20  minutes.  If  a  worksta¬ 
tion  becomes  too  busy,  automatic  migration  of  the 
affected  process  takes  place,  as  we  explain  in  sec¬ 
tion  5. 

All  of  the  above  programs  (initialization,  decomposition, 
submit,  and  monitoring)  are  performed  by  one  desig¬ 
nated  workstation  in  the  cluster.  Although  it  is  possible 
to  perform  the  initialization  and  the  decomposition  in  a 
distributed  fashion  in  principle,  we  have  chosen  a  serial 
approach  for  simplicity. 

Regarding  the  selection  of  free  workstations,  our  strat¬ 
egy  is  to  separate  all  the  workstations  into  two  groups: 
workstations  with  active  users,  and  workstations  with 
idle  users  (meaning  more  than  20  minutes  idle  time). 
An  idle-user  does  not  necessarily  imply  an  idle  worksta¬ 
tion  because  background  jobs  may  be  running;  however, 
an  idle-user  is  preferred  to  an  active  user.  Thus,  we  first 
examine  the  idle-user  workstations  to  see  if  the  fifteen- 
minute  average  of  the  CPU  load  is  below  a  pre-set  value, 
in  which  case  the  workstation  is  selected.  For  example, 
the  load  must  be  less  than  0.6  where  1.0  means  that  a 
full-time  process  is  running  on  the  workstation.  After 
examining  the  idle-user  workstations,  we  examine  the 
active- user  workstations,  and  we  continue  the  search  as 
long  as  we  need  more  workstations. 

In  addition  to  the  above  programs  (initialization,  de¬ 
composition,  submit,  and  monitoring),  there  is  also  the 
parallel  program  which  is  executed  by  all  the  worksta¬ 
tions.  The  parallel  program  consists  of  two  steps:  “com¬ 
pute  locally” ,  and  “communicate  with  neighbors” .  Be¬ 
low  we  discuss  issues  relating  to  communication. 

4.2  Communication 

The  communication  between  parallel  processes  synchro¬ 
nizes  the  processes  in  an  indirect  fashion  because  it  en¬ 
courages  the  processes  to  begin  each  computational  cy¬ 
cle  together  with  their  neighbors  as  soon  as  they  re¬ 
ceive  data  from  their  neighbors.  Thus,  there  is  a  lo¬ 
cal  near-synchronization  which  also  encourages  a  global 
near-synchronization.  However,  neither  local  nor  global 
synchronization  is  guaranteed,  and  in  special  circum¬ 
stances  the  parallel  processes  can  be  several  integration 


time  steps  apart.  This  is  important  when  a  process  mi¬ 
grates  from  a  busy  host  to  a  free  host,  as  we  explain  in 
section  5  (also  see  the  appendix). 

We  organize  the  communication  of  data  between  pro¬ 
cesses  by  using  a  well-known  programming  technique 
which  is  called  “padding”  or  “ghost  cells”  (Fox  [3], 
Camp  [14]).  Specifically,  we  pad  each  subregion  with 
one  or  more  layers  of  extra  nodes  on  the  outside.  We 
use  one  layer  of  nodes  if  the  local  interaction  extends  to 
a  distance  of  one  neighbor,  and  we  use  more  layers  if  the 
local  interaction  extends  further.  Once  we  copy  the  data 
from  one  subregion  onto  the  padded  area  of  a  neighbor¬ 
ing  subregion,  the  boundary  values  are  available  locally 
during  the  current  cycle  of  the  computation.  This  is  a 
good  way  to  organize  the  communication  of  boundary 
values  between  neighboring  subregions. 

In  addition,  padding  leads  to  programming  modular¬ 
ity  in  the  sense  that  the  computation  does  not  need  to 
know  anything  about  the  communication  of  the  bound¬ 
ary.  As  long  as  we  compute  within  the  interior  of  each 
subregion,  the  computation  can  proceed  as  if  there  was 
no  communication  at  all.  Because  of  this  separation  be¬ 
tween  computation  and  communication,  we  can  develop 
a  parallel  program  as  a  straightforward  extension  of  a 
serial  program.  In  our  case,  we  have  developed  a  fluid 
dynamics  code  which  can  produce  either  a  parallel  pro¬ 
gram  or  a  serial  program  depending  on  the  settings  of  a 
few  C-compiler  directives.  The  main  differences  between 
the  parallel  and  the  serial  programs  are  the  padded  areas, 
and  a  subroutine  that  communicates  the  padded  areas 
between  processes. 

We  have  implemented  a  subroutine  that  communicates 
the  padded  areas  between  processes  using  “sockets”  and 
the  TCP/IP  protocol.  A  socket  is  an  abstraction  in  the 
UNIX  operating  system  that  provides  system  calls  to 
send  and  receive  data  between  UNIX  processes  on  differ¬ 
ent  workstations.  A  number  of  different  protocols  (types 
of  behavior)  are  available  with  sockets,  and  TCP/IP  is 
the  simplest  one.  This  is  because  the  TCP/IP  protocol 
guarantees  delivery  of  any  messages  sent  between  two 
processes.  Accordingly,  the  TCP/IP  protocol  behaves 
as  if  there  are  two  first-in-first-out  channels  for  writ¬ 
ing  data  in  each  direction  between  two  processes.  Also, 
once  a  TCP/IP  channel  is  opened  at  startup,  it  remains 
open  throughout  the  computation  except  during  migra¬ 
tion  when  it  must  be  re-opened,  as  we  shall  see  later. 

Opening  the  TCP/IP  channel  involves  a  simple  hand- 


shaking,  '‘I  am  listening  at  this  port  number.  I  want  to 
talk  to  you  at  this  port  number?  Okay,  the  channel  is 
open/’  The  port  numbers  are  needed  to  identify  uniquely 
the  sender  and  the  recipient  of  a  message  so  that  mes¬ 
sages  do  not  get  mixed  up  between  different  UNIX  pro¬ 
cesses.  Further,  the  port  numbers  must  be  known  in 
advance  before  the  TCP/IP  channel  is  opened.  Thus, 
each  process  must  first  allocate  its  port  numbers  for  lis¬ 
tening  to  its  neighbors,  and  then  write  the  port  numbers 
into  a  shared  file.  The  neighbors  must  read  the  shared 
file  before  they  can  connect  using  TCP/IP. 

5  Transparency  to  other  users 

Having  described  the  basic  operation  of  our  distributed 
system,  we  now  discuss  the  issues  that  arise  when  sharing 
the  workstations  with  other  users.  Specifically,  there  are 
two  issues  to  consider:  sharing  the  CPU  cycles  of  each 
workstation,  and  sharing  the  local  area  network  and  the 
file  server.  First,  we  describe  the  sharing  of  CPU  cy¬ 
cles  and  the  automatic  migration  of  processes  from  busy 
hosts  to  free  hosts. 

5.1  Automatic  migration  of  processes 

We  distinguish  the  utilization  of  a  workstation  into  three 
basic  categories: 

•  (i)  The  workstation  is  idle. 

•  (ii)  The  workstation  is  running  an  interactive  pro¬ 
gram  that  requires  fast  CPU  response  and  few  CPU 
cycles. 

•  (iii)  The  workstation  is  running  another  full-time 
process  in  addition  to  a  parallel  subprocess. 

In  the  first  two  cases,  it  is  appropriate  to  time-share  the 
workstation  with  another  user.  Furthermore,  it  is  pos¬ 
sible  to  make  the  distributed  computation  transparent 
to  the  regular  user  of  the  workstation  by  assigning  a 
low  runtime  priority  to  the  parallel  subprocesses  (UNIX 
command  “nice”).  Because  the  regular  user’s  tasks  run 
at  normal  priority,  they  receive  the  full  attention  of  the 
processor  immediately,  and  there  is  no  loss  of  interac¬ 
tiveness.  After  the  user’s  tasks  are  serviced,  there  are 
enough  CPU  cycles  left  for  the  distributed  computation. 

In  the  third  case,  when  a  workstation  is  running  an¬ 
other  full-time  process  in  addition  to  a  parallel  subpro¬ 
cess,  the  parallel  subprocess  must  migrate  to  a  new  host 
that  is  free.  This  is  because  the  parallel  process  interferes 
with  the  regular  user,  and  further,  the  whole  distributed 


computation  slows  down  because  of  the  busy  worksta¬ 
tion.  Clearly,  such  a  situation  must  be  avoided. 

Our  distributed  system  detects  the  need  for  migration 
using  the  monitoring  program  that  we  mentioned  in  the 
previous  section.  The  monitoring  program  checks  the 
CPU  load  of  every  workstation  via  the  UNIX  command 
“uptime” ,  and  signals  a  request  for  migration  if  the  five- 
minute-average  load  exceeds  a  pre-set  value,  typically 
1.5.  The  intent  is  to  migrate  only  if  a  second  full-time 
process  is  running  on  the  same  host,  and  to  avoid  mi¬ 
grating  too  often.  In  our  system  there  is  typically  one 
migration  every  45  minutes  for  a  distributed  computa¬ 
tion  that  uses  20  workstations  from  a  pool  of  25  work¬ 
stations.  Also,  each  migration  lasts  about  30  seconds. 
Thus,  the  cost  of  migration  is  insignificant  because  the 
migrations  do  not  happen  too  often. 

During  a  migration,  a  precise  sequence  of  events  takes 
place  in  order  for  the  migration  to  complete  successfully, 

•  The  affected  process  A  receives  a  signal  to  migrate. 

•  All  the  processes  get  synchronized. 

•  Process  A  saves  its  state  into  a  dump  file,  and  stops 
running. 

•  Process  A  is  restarted  on  a  free  host,  and  the  dis¬ 
tributed  computation  continues. 

Signals  for  migration  are  sent  through  an  interrupt  mech¬ 
anism,  “kill  -USR2”  (see  UNIX  manual).  In  this  way, 
both  the  regular  user  of  a  workstation  and  our  monitor¬ 
ing  program  can  request  a  parallel  subprocess  to  migrate 
at  any  time. 

The  reason  for  synchronizing  all  the  processes  prior  to 
migration,  is  to  simplify  the  restarting  of  the  processes 
after  the  migration  has  completed.  In  addition,  the  syn¬ 
chronization  allows  more  than  one  process  to  migrate  at 
the  same  time  if  it  is  desired.  In  our  system,  we  use  a 
synchronization  scheme  which  instructs  all  the  processes 
to  continue  running  until  a  chosen  synchronization  time 
step,  and  then  to  pause  for  the  migration  to  take  place. 
The  details  of  the  synchronization  scheme  are  described 
in  the  appendix. 

When  all  the  processes  reach  the  synchronization  time 
step,  the  processes  that  need  to  migrate  save  their  state 
and  exit,  while  they  notify  the  monitoring  program  to 
select  free  workstations  for  them.  The  other  parallel  pro¬ 
cesses  suspend  execution  and  close  their  TCP/IP  com¬ 
munication  channels.  When  the  monitoring  program 
finds  free  hosts  for  all  the  migrating  processes,  it  sends 


a  CONT  signal  to  the  waiting  processes.  In  response, 
all  the  processes  re-open  their  communication  channels, 
and  the  distributed  computation  continues  normally. 

Overall,  the  migration  mechanism  is  designed  to  be  as 
simple  as  possible.  In  fact,  it  is  equivalent  to  stopping 
the  computation,  saving  the  entire  state  on  disk,  and 
then  restarting;  except,  we  only  save  the  state  of  the 
migrating  process  on  disk.  In  contrast  to  this  simple 
migration  mechanism,  we  note  that  process  migration  in 
a  general  computing  environment  such  as  a  distributed 
operating  system  [15]  can  be  a  challenging  task.  In  our 
case  the  task  has  been  simplified  because  we  can  design 
our  processes  appropriately  to  accommodate  migration 
easily. 

5.2  Sharing  the  network  and  file  server 

A  related  issue  to  sharing  the  workstations  with  other 
users,  is  the  sharing  of  the  network  and  the  file  server. 
A  distributed  program  must  be  carefully  designed  to 
make  sure  that  the  system  does  not  monopolize  the  net¬ 
work  and  the  file  server.  Abuse  of  shared  resources  is 
very  common  in  today’s  UNIX  operating  system  because 
there  are  no  direct  mechanisms  for  controlling  or  limit¬ 
ing  the  use  of  shared  resources.  Thus,  a  program  such 
as  FTP  (file  transfer)  is  free  to  send  many  megabytes 
of  data  through  the  network,  and  to  monopolize  the 
network,  so  that  the  network  appears  “frozen”  to  other 
users.  A  distributed  program  can  monopolize  the  net¬ 
work  in  a  similar  way,  if  it  is  not  designed  carefully. 

Our  distributed  system  does  not  monopolize  the  net¬ 
work  because  it  includes  a  time  delay  between  successive 
send-operations,  during  which  the  parallel  processes  are 
calculating  locally.  Moreover,  the  time  delay  increases 
with  the  network  traffic  because  the  parallel  processes 
must  wait  to  receive  data  before  they  can  start  the  next 
integration  step.  Thus,  there  is  an  automatic  feedback 
mechanism  that  slows  down  the  distributed  computa¬ 
tion,  and  allows  other  users  to  access  the  network  at  the 
same  time. 

Another  situation  to  consider  is  when  the  parallel 
processes  are  writing  data  to  the  common  file  system. 
Specifically,  when  all  the  parallel  processes  save  their 
state  on  disk  at  approximately  the  same  time  (a  couple 
of  megabytes  per  process),  it  is  very  easy  to  saturate 
both  the  network  and  the  file  server.  In  order  to  avoid 
this  situation,  we  impose  the  constraint  that  the  parallel 
processes  must  save  their  state  one  after  the  other  in  an 


orderly  fashion,  allowing  sufficient  time  gaps  between, 
so  that  other  programs  can  use  the  network  and  the  file 
system.  Thus,  a  saving  operation  that  would  take  30 
seconds  and  monopolize  the  shared  resources,  now  takes 
60  —  90  seconds  but  leaves  free  time  slots  for  other  pro¬ 
grams  to  access  the  shared  resources  at  the  same  time. 
Overall,  a  careful  design  has  made  our  distributed  sys¬ 
tem  mostly  transparent  to  the  regular  users  of  the  work¬ 
stations. 


6  Fluid  dynamics 


Having  described  the  overall  design  of  our  distributed 
system,  we  now  turn  our  attention  to  the  specifics  of  fluid 
dynamics.  First,  we  review  the  equations  of  fluid  dynam¬ 
ics,  and  then  we  explain  why  local  interaction  methods 
are  appropriate  for  simulating  subsonic  flow.  Finally,  we 
outline  the  numerical  methods  that  we  use  in  our  system. 

The  evolution  of  a  flow  is  described  using  a  set  of  par¬ 
tial  differential  equations,  known  as  the  Navier  Stokes 
equations  (Tritton  [16],  Batchelor  [17],  Lamb  [18]). 
These  equations  can  take  different  forms  depending  on 
the  specific  problem  at  hand.  In  our  case,  the  Navier 
Stokes  equations  involve  three  fluid  variables  /),  14,1^: 
the  fluid  density,  and  the  components  of  the  fluid  ve¬ 
locity  in  the  x,y  directions  respectively.  The  variables 
P.Vx^Vy  are  functions  of  space  and  time,  and  the  Navier 
Stokes  equations  express  the  rates  of  change  of  these 
variables,  as  follows, 
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In  the  above  equations,  the  symbol  is  the  Laplacian 
operator  jdx^  and  the  coefficients  v  and  c, 

are  constants,  v  is  the  kinematic  viscosity  of  the  fluid 
(a  kind  of  friction),  and  c,  is  the  speed  of  sound.  In  the 
case  of  three-dimensional  flow  problems,  there  is  another 
equation  for  the  14  the  component  of  fluid  velocity  in  the 
z-direction.  Details  can  be  found  in  any  textbook  of  fluid 
mechanics. 

A  flow  is  simulated  by  solving  the  Navier  Stokes  equa¬ 
tions  numerically.  In  particular,  a  grid  of  fluid  nodes  is 
introduced,  which  looks  very  much  like  the  grid  of  nodes 
in  figure  3.  The  fluid  nodes  are  discrete  locations  where 
the  fluid  variables  density  and  velocity  are  calculated 


at  discrete  times.  A  numerical  method  is  used  to  cal¬ 
culate  the  future  values  of  density  and  velocity  at  each 
fluid  node  using  the  present  and  the  past  values  of  den¬ 
sity  and  velocity  at  this  node,  at  neighboring  nodes,  and 
possibly  at  distant  nodes  as  well. 

A  numerical  method  that  employs  only  neighboring 
nodes  to  calculate  the  future  solution,  is  called  an  ex¬ 
plicit  method  (or  local  interaction  method),  and  is  ideal 
for  parallel  computing.  Such  a  method  is  also  referred  to 
as  a  “time-marching”  method  because  the  present  values 
of  each  fluid  node  and  its  neighbors  produce  the  future 
value  of  this  fluid  node  at  time  <  +  A^,  and  so  on  repeat¬ 
edly,  where  At  is  the  integration  time  step.  By  contrast, 
a  numerical  method  that  employs  distant  nodes  to  cal¬ 
culate  the  future  solution,  is  called  an  implicit  method, 
and  is  difficult  to  parallelize.  This  is  because  an  implicit 
method  computes  the  solution  using  a  large  matrix  equa¬ 
tion  that  couples  together  distant  fluid  nodes,  and  leads 
to  complex  communication  between  distant  nodes. 

There  are  advantages  to  both  explicit  and  implicit 
methods.  The  obvious  advantage  of  explicit  methods  is 
the  ease  of  parallelization.  Another  issue  to  consider  is 
that  an  explicit  integration  step  is  much  less  costly  than 
an  implicit  integration  step.  A  disadvantage  of  explicit 
methods  is  that  they  become  numerically  unstable  at 
large  time  steps  At.  By  contrast,  implicit  methods  can 
often  use  much  larger  integration  time  steps  At  than 
explicit  methods  (Peyret&Taylor  [19]).  Thus,  implicit 
methods  can  often  compute  a  solution  using  fewer  time 
steps  than  explicit  methods.  In  a  practical  situation,  one 
has  to  consider  all  of  the  above  issues  to  decide  whether 
implicit  or  explicit  methods  are  more  suitable.  Namely, 
one  has  to  consider  the  relative  cost  of  an  implicit  step 
versus  an  explicit  step,  the  availability  of  parallel  com¬ 
puting,  and  the  nature  of  the  problem  which  affects  the 
choice  of  a  small  or  a  large  integration  time  step. 

In  the  case  of  simulating  subsonic  flow,  the  nature  of 
the  problem  does  not  allow  the  use  of  very  large  inte¬ 
gration  time  steps  At.  This  is  because  subsonic  flow 
includes  two  different  time-scales  -  slow-moving  hydro- 
dynamic  flow  and  fast-moving  acoustic  waves  -  and  the 
latter  dominate  the  choice  of  integration  time  step.  In 
particular,  the  time  step  At  must  be  very  small  to  model 
accurately  the  acoustic  waves  that  propagate  through 
the  fluid  and  reflect  off  obstacles.  If  Ax  is  the  spacing 
between  neighboring  fluid  nodes,  and  c,  is  the  speed  of 
propagation  of  acoustic  waves,  then  the  product  CgAt 


must  be  comparable  to  Ax  in  order  to  have  enough  res¬ 
olution  to  follow  the  passage  and  reflection  of  acoustic 
waves.  Thus,  we  require  the  relation. 

Ax  CsAt  (4) 

Because  of  this  requirement,  the  large  time  steps  of  im¬ 
plicit  methods  are  not  relevant.  Instead,  explicit  meth¬ 
ods  are  preferable  in  this  case  because  of  their  simplicity 
and  ease  of  parallelization. 

In  our  system,  we  employ  the  following  two  explicit 
methods:  explicit  finite  differences  (Peyret&Taylor  [19]), 
and  the  recently-developed  lattice  Boltzmann  method 
(Skordos  [20]).  The  finite  difference  method  is  a  straight¬ 
forward  discretization  of  the  Navier  Stokes  equations  I- 
3.  Specifically,  the  spatial  derivatives  are  discretized  us¬ 
ing  centered  differences  on  a  uniform  orthogonal  grid, 
and  the  time  derivatives  are  discretized  using  forward 
Euler  differences  (Peyret&Taylor  [19]).  For  the  purpose 
of  improving  numerical  stability,  the  density  equation  1 
is  updated  using  the  values  of  velocity  at  time  t  +  At. 
In  other  words,  the  velocities  values  are  computed  first, 
and  then  the  density  values  are  computed  as  a  separate 
step.  The  precise  sequence  of  computational  steps  for 
the  finite  difference  method  is  as  follows, 

•  Calculate  Vr^Vy  (inner) 

•  Communicate:  send/recv  14,  Vy  (boundary) 

•  Calculate  p  (inner) 

•  Communicate:  send/recv  p  (boundary) 

•  Filter  p,  14 ,  Vy  (inner) 

The  filter  that  is  included  above  is  crucial  for  simulat¬ 
ing  subsonic  flow  at  high  Reynolds  number  (fast  moving 
flow).  The  fast  flow  and  the  interaction  between  acoustic 
waves  and  hydrodynamic  flow  can  lead  to  slow-growing 
numerical  instabilities.  The  filter  prevents  the  instabili¬ 
ties  by  dissipating  high  spatial  frequencies  whose  wave¬ 
length  is  comparable  to  the  grid  mesh  size  (the  distance 
between  neighboring  fluid  nodes).  Our  filter  is  based  on 
a  fourth  order  numerical  viscosity  (Peyret&Taylor  [19]). 
We  use  the  same  filter  both  for  the  finite  difference 
method  and  for  the  lattice  Boltzmann  method. 

The  lattice  Boltzmann  method  is  a  recently-developed 
method  for  simulating  subsonic  flow,  which  is  compet¬ 
itive  with  finite  differences  in  terms  of  numerical  accu¬ 
racy.  Because  the  lattice  Boltzmann  method  is  a  relax¬ 
ation  type  of  algorithm,  it  is  somewhat  more  stable  than 
explicit  finite  differences.  The  lattice  Boltzmann  method 


uses  two  kinds  of  variables  to  represent  the  fluid,  the 
traditional  fluid  variables  p,Vx^Vy^  and  another  set  of 
variables  called  populations  F,-.  During  each  cycle  of  the 
computation,  the  fluid  variables  Vx,  Vy  are  computed 
from  the  F,- ,  and  then  the  /?,  14 ,  Vy  are  used  to  relax  the 
Fi.  Subsequently,  the  relaxed  populations  are  shifted  to 
the  nearest  neighbors  of  each  fluid  node,  and  the  cycle 
repeats.  The  precise  sequence  of  computational  steps  for 
the  lattice  Boltzmann  method  is  as  follows, 

•  Relax  Fi  (inner) 

•  Shift  Fi  (inner) 

•  Communicate:  send/recv  Fi  (boundary) 

•  Calculate  p,Vx,Vy  from  F,-  (inner) 

•  Filter  p^Vx^Vy  (inner) 

More  details  on  the  lattice  Boltzmann  method  can  be 
found  in  Skordos  [20]. 

Regarding  the  communication  of  boundary  values  by 
the  finite  difference  method  (FD)  and  the  lattice  Boltz¬ 
mann  method  (LB),  there  are  some  differences  that  will 
become  important  in  the  next  two  sections,  when  we  dis¬ 
cuss  the  performance  of  our  parallel  simulation  system. 
The  first  difference  is  that  FD  sends  two  messages  per 
computational  cycle  as  opposed  to  LB  which  sends  all 
the  boundary  data  in  one  message.  This  results  in  slower 
communication  for  FD  when  the  messages  are  small  be¬ 
cause  each  message  has  a  significant  overhead  in  a  local 
area  network.  The  second  difference  is  that  LB  commu¬ 
nicates  5  variables  (double  precision  floating-point  num¬ 
bers)  per  fluid  node  in  three  dimensional  problems,  while 
FD  communicates  only  4  variables  per  fluid  node.  In 
two  dimensional  problems,  both  methods  communicate 
3  variables  per  fluid  node. 

7  Experimental  measurements  of 
performance 

The  performance  of  the  parallel  simulation  system 
is  measured  when  using  the  finite  difference  method 
and  the  lattice  Boltzmann  method  to  simulate  a  well- 
known  problem  in  fluid  mechanics,  Hagen-Poiseuille  flow 
through  a  rectangular  channel  (Skordos  [20]  and  Lan- 
daui^Lifshitz  [21,  p.51]).  The  goal  of  testing  two  differ¬ 
ent  numerical  methods  is  to  examine  the  performance 
of  the  parallel  system  on  two  similar,  but  slightly  differ¬ 
ent  parallel  algorithms.  The  question  of  which  numerical 
method  is  better  for  a  particular  problem  is  not  our  main 
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Figure  5:  Parallel  efficiency  in  2D  simulations  using  lat¬ 
tice  Boltzmann. 

concern  here.  However,  we  can  say  that  the  two  meth¬ 
ods  produce  comparable  results  for  the  same  resolution 
in  space  and  time.  Moreover,  both  methods  converge 
quadratically  with  increased  resolution  in  space  to  the 
exact  solution  of  the  Hagen-Poiseuille  flow  problem. 

Below  we  present  measurements  of  the  parallel  effi¬ 
ciency  /,  and  the  speedup  S  defined  as  follows, 

/  =  -  =  (5) 

where  Tp  is  the  elapsed  time  for  integrating  a  problem, 
using  P  processors,  and  Ti  is  the  elapsed  time  for  inte¬ 
grating  the  same  problem  using  a  single  processor.  We 
measure  the  times  Tp  and  Ti  for  integrating  a  problem 
by  averaging  over  20  consecutive  integration  steps,  and 
also  by  averaging  over  each  processor  that  participates  in 
the  parallel  computation.  The  resulting  average  is  the 
time  interval  it  takes  to  perform  one  integration  step. 
We  use  the  UNIX  system  call  “gettimeofday”  to  obtain 
accurate  timings.  To  avoid  situations  where  the  Ether¬ 
net  network  is  overloaded  by  a  large  FTP  or  something 
else,  we  repeat  each  measurement  twice,  and  select  the 
best  performance. 

We  use  twenty-five  HP9000/700  workstations  that  are 
connected  together  by  a  shared-bus  Ethernet  network. 
Sixteen  of  the  workstations  are  715/50  models,  six  are 
720  models,  and  three  are  710  models.  The  715/50  work¬ 
stations  are  based  on  a  Risk  processor  running  at  50 
MHz,  and  have  an  estimated  performance  of  62  MIPS 


Figure  6:  Parallel  speedup  in  2D  simulations  using  lat¬ 
tice  Boltzmann. 

and  13  MFLOPS,  while  the  720  and  710  workstations 
have  a  slightly  lower  performance. 

For  analysis  purposes,  we  define  the  speed  of  a  work¬ 
station  as  the  number  of  fluid  nodes  integrated  per  sec¬ 
ond,  where  the  number  of  fluid  nodes  does  not  include 
the  padded  areas  discussed  in  section  4.2.  The  table 
below  presents  the  speed  of  the  workstations  for  2D  and 
3D  simulations  using  the  lattice  Boltzmann  method  (LB) 
and  the  finite  difference  method  (FD).  We  have  calcu¬ 
lated  these  numbers  by  averaging  over  simulations  of  dif¬ 
ferent  size  grids  that  range  from  100^  to  300^  fluid  nodes 
in  2D,  and  from  10^  to  44^  in  3D.  Also,  we  have  nor¬ 
malized  the  speeds  relative  to  the  speed  of  the  715/50 
workstation, 


715/50 

710 

720 

LB  2D 

1.0  ±.04 

.84  ±.02 

.86  ±.08 

LB  3D 

.51  ±.01 

.40  ±.01 

.42  ±  .02 

FD  2D 

1.24  ±.l 

1.08  ±.l 

1.17±.l 

FD  3D 

1.0±.l 

.85±.l 

.94  ±  .1 

The  relative  speed  of  1.0  corresponds  to  39132  fluid 
nodes  integrated  per  second. 

In  our  graphs  of  parallel  speedup  and  efficiency,  we  use 
the  the  715/50  workstation  to  represent  the  single  pro¬ 
cessor  performance.  We  do  not  use  the  performance  of 
the  slowest  workstation  (the  710  model)  for  normaliza¬ 
tion  purposes  because  it  would  over-estimate  the  perfor¬ 
mance  of  our  system.  In  particular,  most  of  the  worksta¬ 
tions  are  715  models,  and  our  strategy  is  to  choose  715 


Figure  7 :  Parallel  efficiency  in  2D  simulations  using  finite 
differences. 

models  first  before  choosing  the  slightly  slower  710  and 
720  models.  We  have  tested  that  the  speedup  achieved 
by  sixteen  workstations,  which  are  all  715  models,  does 
not  change  if  one  or  two  workstations  are  replaced  with 
710  models.  Thus,  it  makes  sense  to  normalize  our  re¬ 
sults  using  the  performance  of  the  715  model. 

Figure  5  shows  the  efficiency  cis  a  function  of  grain  size 
for  (2x2),  (3x3),  (4x4),  and  (5x4)  decompositions  (tri¬ 
angles,  crosses,  squares,  circles).  On  the  horizontal  axis, 
we  plot  the  square  root  of  number  of  nodes  N  of  each 
subregion.  We  see  that  good  performance  is  achieved 
in  two-dimensional  simulations  when  the  subregion  per 
processor  is  larger  than  100^  fluid  nodes.  In  the  next 
section,  we  present  a  theoretical  model  of  parallel  ef¬ 
ficiency  that  predicts  very  accurately  our  experimental 
results  shown  in  figure  5  and  in  the  other  figures  also. 
Figure  6  shows  the  speedup  for  the  lattice  Boltzmann 
method  (LB),  and  figures  7  and  8  show  the  efficiency 
and  speedup  for  the  finite  difference  method  (FD). 

We  notice  one  difference  between  the  FD  and  LB  ef¬ 
ficiency  curves:  the  efficiency  decreases  more  rapidly  for 
FD  than  LB  as  the  subregion  per  processor  decreases. 
To  understand  this  difference,  we  quote  a  general  for¬ 
mula  for  the  parallel  efficiency,  which  is  derived  in  the 
next  section  (see  equation  12), 

/  =  (l  +  p^)  '  (6) 

10 


where  Team  and  Tcau  are  the  communication  and  the 


Figure  8:  Parallel  speedup  in  2D  simulations  using  finite 
differences. 

computation  time  it  takes  to  perform  one  integration 
step.  We  observe  that  TcaU  Is  smaller  for  FD  than  LB 
(see  the  table  of  speeds  earlier),  and  moreover  that  Tcom 
becomes  larger  for  FD  than  LB  as  the  subregion  per  pro¬ 
cessor  decreases.  The  latter  is  true  because  each  message 
in  a  local  area  network  incurs  an  overhead,  and  FD  com¬ 
municates  two  messages  per  integration  step  as  opposed 
to  LB  which  communicates  only  one  message  per  inte¬ 
gration  step  (see  end  of  section  6).  Because  of  these 
differences  between  FD  and  LB,  the  efficiency  decreases 
more  rapidly  for  FD  than  LB  as  the  subregion  per  pro¬ 
cessor  decreases. 

Next,  we  compare  the  efficiency  of  three-dimensional 
simulations  versus  two-dimensional  ones,  using  the  lat¬ 
tice  Boltzmann  method.  Figure  9  plots  the  efficiency  of 
2D  and  3D  simulations  as  a  function  of  the  number  of 
processors  P.  Here,  we  simulate  a  problem  which  grows 
linearly  with  the  number  of  processors  P,  and  is  decom¬ 
posed  as  (P  X  1)  in  2D,  and  as  (P  x  1  x  1)  in  3D.  The 
subregion  per  processor  is  held  fixed  at  120^  nodes  in 
2D,  and  25^  nodes  in  3D,  which  are  comparable  sizes, 
equal  to  about  14,500  fluid  nodes  per  processor*  We 
see  that  the  efficiency  remains  high  in  2D  (triangles), 
and  decreases  quickly  in  3D  (crosses)  as  the  number  of 
processors  increases.  This  is  because  the  total  traffic 
through  the  shared-bus  network  increases  in  proportion 
to  the  number  of  processors,  and  this  affects  Tcom 
equation  6  as  we  shall  see  in  more  detail  in  the  next  sec- 


Figure  9:  The  Ethernet  network  performs  well  for  2D 
simulations  (triangles),  but  poorly  for  3D  simulations 
(crosses). 

tion.  Also,  we  note  that  3D  requires  much  more  data 
to  be  communicated  per  step  than  2D.  Thus,  Tcom  in¬ 
creases  faster  for  3D  than  2D,  and  the  efficiency  drops 
faster  in  the  case  of  3D  simulations. 

Another  way  of  examining  the  efficiency  of  3D  simula¬ 
tions  is  shown  in  figures  10  and  11.  Figure  10  plots  the 
efficiency  against  the  size  of  the  subregion  for  different 
decompositions  (2x2x2),  (3x2x2),  etc.  We  can 
see  that  the  efficiency  is  rather  poor.  Figure  11  plots 
the  speedup  against  the  total  size  of  the  problem.  We 
can  see  that  the  speedup  does  not  improve  when  finer 
decompositions  are  employed  because  the  network  is  the 
bottleneck  of  the  computation. 

The  results  shown  in  figures  10  and  11  have  been  ob¬ 
tained  using  the  lattice  Boltzmann  method.  The  par¬ 
allel  efficiency  of  the  finite  difference  method  (FD)  in 
3D  simulations  is  even  worse  than  the  lattice  Boltzmann 
method  (LB),  and  is  not  shown  here.  The  FD  efficiency 
is  worse  than  LB  because  the  FD  computes  twice  as  fast 
as  LB  per  integration  step  (see  earlier  table  of  speeds), 
which  makes  the  ratio  TcomjTcaXc  larger  for  FD  than  LB, 
and  leads  to  lower  efficiency  according  to  equation  6. 

We  note  that  in  our  system  the  low  efficiency  of  3D 
simulations  is  accompanied  by  frequent  network  errors 
because  of  excessive  network  traffic.  In  particular,  the 
TCP/IP  protocol  fails  to  deliver  messages  after  excessive 
retransmissions.  Both  the  low  efficiency,  and  the  network 
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Figure  10:  Parallel  efficiency  m  3D  simulations  using  the 
lattice  Boltzmann  method. 


Figure  11:  Parallel  speedup  in  3D  simulations  using  the 
lattice  Boltzmann  method. 


errors  indicate  the  need  for  a  faster  network,  or  dedicated 
connections  between  neighboring  processors  in  order  to 
perform  3D  simulations  efficiently. 

8  Theoretical  analysis  of  parallel 
efficiency 

In  order  to  understand  better  the  experimental  results 
of  the  previous  section,  we  develop  a  theoretical  model 
of  the  parallel  efficiency  of  local  interaction  problems. 
In  particular,  we  derive  a  formula  for  the  parallel  effi¬ 
ciency  in  terms  of  the  parallel  grain  size  (the  size  of  the 
subregion  that  is  assigned  to  each  processor),  the  speed 
of  the  processors,  and  the  speed  of  the  communication 
network.  Our  analysis  is  based  on  two  assumptions:  (i) 
the  computation  is  completely  parallelizable,  and  (ii)  the 
communication  does  not  overlap  in  time  with  the  com¬ 
putation.  The  first  assumption  is  valid  for  local  inter¬ 
action  problems,  and  the  second  assumption  is  valid  for 
the  distributed  system  that  we  have  implemented.  The 
extension  of  our  analysis  to  situations  where  communica¬ 
tion  and  computation  overlap  in  time  is  straightforward 
as  we  shall  see  afterwards. 

We  first  examine  the  relationship  between  the  effi¬ 
ciency  and  the  processor  utilization.  We  define  the  ef¬ 
ficiency  /  as  the  speedup  S  divided  by  the  number  of 
processors  P.  Further,  we  define  the  speedup  S  as  the 
ratio  Ti/Tp  of  the  total  time  it  takes  to  solve  a  problem 
using  one  processor,  denoted  Ti,  divided  by  the  total 


time  it  takes  to  solve  the  same  problem  using  P  proces¬ 
sors,  denoted  Tp.  In  other  words,  we  have  the  following 
expression, 


Ti 

pn 


(7) 


We  define  the  processor  utilization  g  as  the  fraction  of 
time  spent  for  computing,  denoted  Tcaic^  divided  by  the 
total  time  spent  for  solving  a  problem  which  includes 
both  computing  and  waiting  for  communication  to  com¬ 
plete.  Also,  we  use  the  simplifying  assumption  that  the 
communication  and  the  computation  do  not  overlap  in 
time,  so  that  we  define  Tcom  as  the  time  spent  for  com¬ 
munication  without  any  computation  occurring  during 
this  time.  Thus,  we  have  the  following  expression, 


Tcalc 


Tcalc  +  Tct 


=  1  + 


Tcalc  J 


(8) 


To  compare  /  and  g,  we  note  that  the  values  of  both  / 
and  g  range  between  the  following  limits, 


0  <  <  1 
0</<  1 


(9) 


for  the  worst  case  and  the  best  case  respectively.  We  ex¬ 
pect  that  high  utilization  g  corresponds  to  high  parallel 
efficiency  /;  however,  this  depends  on  the  problem  that 
we  are  trying  to  compute  in  parallel. 

In  the  special  case  of  a  problem  that  is  completely 
parallelizable,  the  processor  utilization  g  is  exactly  equal 
to  the  parallel  efficiency  /.  To  show  this,  we  use  the 
following  relation  as  the  definition  of  a  problem  being 


completely  parallelizable, 

TcaU  =  y  (10) 

Then,  we  also  use  the  assumption  that  communication 
and  computation  do  not  overlap  in  time,  so  that  we  can 
obtain  a  second  relation, 

{Tcalc  +  Tcom)  =  Tp  (11) 

By  substituting  equations  10  and  11  into  equation  7, 
and  comparing  with  equation  8,  we  arrive  at  the  desired 
result  that  the  parallel  efficiency  is  exactly  equal  to  the 
processor  utilization, 

We  have  derived  the  above  equation  under  the  assump¬ 
tion  that  communication  and  computation  do  not  over¬ 
lap  in  time.  If  this  assumption  is  violated  in  a  practical 
situation,  then  the  communication  time  Tcom  should  be 
replaced  with  a  smaller  time  interval,  the  effective  com¬ 
munication  time.  This  modification  does  not  change  the 
conclusion  /  =  ^,  it  simply  gives  higher  values  of  effi¬ 
ciency  and  utilization. 

To  proceed  further,  we  need  to  find  how  the  ratio 
Tcom/Tcaic  depends  on  the  size  of  the  subregion.  First, 
we  observe  that  Tcalc  is  proportional  to  the  size  of  the 
subregion.  If  N  is  the  size  of  the  subregion  (the  num¬ 
ber  of  parallel  nodes  that  constitute  one  subregion),  we 
write, 

where  UcaU  is  a  constant,  the  computational  speed  of  the 
processors  for  the  specific  problem  at  hand.  In  a  similar 
way,  we  seek  to  find  a  formula  for  the  communication 
time  Tcom  in  terms  of  the  size  of  the  subregion  that  is 
assigned  to  each  processor.  As  a  first  model,  we  write 
the  following  simple  expression, 

Tc.m  =  7^  (14) 

where  Nc  is  the  number  of  communicating  nodes  in  each 
subregion,  namely  the  outer  surface  of  each  subregion. 
The  factor  Ucom  represents  the  speed  of  the  communica¬ 
tion  network. 

For  analysis  purposes,  we  want  to  know  exactly  how 
Nc  varies  with  the  size  of  the  subregion  N.  We  consider 
the  geometry  of  a  subregion  in  two  dimensions.  We  can 
see  that  the  boundary  of  a  subregion  is  one  power  smaller 
than  the  volume  expressed  in  terms  of  the  number  of 


nodes.  For  example,  if  we  consider  square  subregions 
of  size  nodes,  the  enclosing  boundary  contains  4L 
nodes,  and  the  ratio  of  communicating  nodes  to  the  total 
number  of  nodes  per  subregion  can  be  as  large  as  4/L. 
In  general,  we  have  the  following  relations, 


(15) 

AT,  = 

(16) 

in  two  and  three  dimensions  respectively,  where  the  con¬ 
stant  m  depends  on  the  geometry  of  the  decomposition. 
For  example,  if  the  decomposition  of  a  problem  is  (Px  1), 
then  m  =  2  because  each  subregion  communicates  with 
its  left  and  right  neighbors  only.  The  following  table 
gives  m  for  a  few  decompositions  which  we  use  in  our 
performance  measurements  in  section  7, 


P  X  1 

2x2 

3x3 

4x4 

5x4 

m 

2 

2 

3 

4 

4 

If  we  introduce  the  above  formulas  for  Nc  and  m  into 
equation  12,  we  obtain  the  following  expressions  for  the 
parallel  efficiency  of  a  local  interaction  problem  in  two 
and  three  dimensions  respectively, 

/  =  fi  +  N-'/^  (17) 

f  =  +  (18) 

\  ^  com  / 

The  above  equations  show  that  if  N  is  sufficiently  large 
compared  to  the  term  mU com  locale,  then  we  can  achieve 
high  parallel  efficiency. 

A  few  comments  are  in  order.  First,  we  must  remem¬ 
ber  that  in  practice  we  can  not  increase  arbitrarily  the 
size  of  the  subregion  per  processor  in  order  to  achieve 
high  efficiency.  This  is  because  the  computation  may 
take  too  long  to  complete,  and  because  the  memory  of 
each  workstation  is  limited.  In  our  present  system,  each 
workstation  has  maximum  memory  32  megabytes,  and 
a  large  part  of  this  memory  is  taken  by  other  programs, 
and  other  users.  A  practical  upper  limit  of  how  much 
memory  we  can  use  per  workstation  is  15  megabytes, 
which  corresponds  to  300^  fluid  nodes  in  2D  simulations 
and  40^  fluid  nodes  in  3D  simulations. 

In  2D  simulations  the  upper  limit  of  300^  fluid  nodes 
per  subregion  is  large  enough  to  achieve  high  efficiency. 
As  we  saw  in  figure  5,  high  efficiency  is  achieved  when  the 
subregion  per  processor  is  larger  than  100^  fluid  nodes. 
By  contrast,  in  3D  simulations  the  upper  limit  of  40^ 
fluid  nodes  per  subregion  is  too  small  to  achieve  high 
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Figure  12:  Theoretical  model  of  parallel  efficiency  for 
two-dimensional  subregions  of  size  N . 


efficiency.  Further,  the  efficiency  depends  on  the  size  of 
the  subregion  as  in  3D  versus  in  2D.  as  we 

can  see  from  equations  17  and  18.  This  means  that  the 
size  of  the  subregion  N  must  increase  much  faster  in  3D 
than  in  2D  to  achieve  similar  improvements  in  efficiency. 
Because  of  this  fact,  achieving  high  efficiency  in  3D  sim¬ 
ulations  is  much  more  difficult  than  in  2D  simulations. 

Having  described  the  basic  idea  behind  our  model  of 
parallel  efficiency,  we  now  discuss  a  small  improvement 
of  our  model.  We  observe  that  in  the  case  of  a  shared- 
bus  network  the  communication  time  Tcom  must  depend 
on  the  number  of  processors  that  are  using  the  network. 
In  particular,  if  we  assume  that  all  the  processors  ac¬ 
cess  the  shared-bus  network  at  the  same  time,  then  the 
communication  time  Tcom  must  increase  linearly  with 
the  number  of  processors.  Baised  on  this  assumption,  we 
rewrite  equation  14  for  Tcom  ^  follows, 

1) 


Tcom  = 


(19) 


for  the  case  of  two  dimensional  problems.  The  constant 
Vcom  is  the  speed  of  communication  when  there  are  only 
two  processors  sharing  the  network.  Using  the  new  ex¬ 
pression  for  Tcomy  the  equation  of  parallel  efficiency  in 
two  dimensions  becomes  as  follows, 


/  =  (l  +  (20) 

\  ^com  J 

To  verify  our  model,  below  we  compare  the  efficiency 
that  is  predicted  by  our  model  against  the  experimen¬ 
tally  measured  efficiency  of  section  7. 


number  of  processors  P 

Figure  13:  Theoretical  model  of  parallel  efficiency  which 
assumes  that  the  communication  time  increases  linearly 
with  the  number  of  processors. 


Figure  12  plots  the  efficiency  /  versus  accord¬ 

ing  to  formula  20,  using  UcaicIVcom  =2/3.  The  four 
curves  marked  with  triangle,  cross,  square,  circle  corre¬ 
spond  to  different  numbers  of  processors  P  =  4,  9, 16,  20 
and  also  different  values  of  m  =  2,  3,4,4  which  depends 
on  the  geometry  of  the  decomposition  as  we  explained 
earlier.  A  comparison  between  the  predicted  efficiency 
shown  in  figure  12  and  the  experimentally  measured  ef¬ 
ficiency  shown  in  figure  5  reveals  good  agreement  when 
the  subregion  per  processor  is  larger  than  N  >  100^. 
However,  for  small  subregions,  N  <  100^,  the  predicted 
efficiency  is  too  high  compared  to  the  experimental  ef¬ 
ficiency.  The  reason  for  this  is  that  messages  in  a  lo¬ 
cal  area  network  have  a  large  overhead  which  becomes 
important  when  the  messages  are  small,  namely,  when 
the  subregion  per  processor  is  smaller  than  N  <  100^ 
fluid  nodes.  The  overhead  of  small  messages  leads  to  a 
smaller  communication  speed  Vcomy  s^nd  a  correspond¬ 
ing  decrease  of  efficiency  /.  We  have  not  attempted  to 
model  the  overhead  of  small  messages  here. 
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Another  way  of  examining  the  validity  of  equation 
equation  20  is  to  plot  the  efficiency  /  versus  the  num¬ 
ber  of  processors  P  while  keeping  all  other  parameters 
constant.  In  figure  13,  we  plot  the  efficiency  of  2D  sim¬ 
ulations  according  to  equation  20  using  N  =  125^.  We 
set  UcaicIVcom  =  2/3  as  we  did  in  figure  12,  and  we  set 
m  =  2  because  each  subregion  communicates  with  its  left 


and  right  neighbors  only.  For  comparison  purposes,  we 
also  plot  the  efficiency  of  3D  simulations,  using  N  =  25^ 
and  m  =  2,  The  computational  speed  is  half  as  large 
in  3D  than  in  2D,  and  the  communication  of  each  fluid 
node  in  3D  requires  5/3  as  much  data  as  in  2D.  Taking 
these  numbers  into  account,  we  can  write  the  following 
expression  for  the  parallel  efficiency  of  3D  simulations, 

/  =  +  (21) 

where  the  factor  5/6  arises  because  we  use  the  2D  values 
of  Ucaic  and  Vcom  which  give  Ucaic/Vcom  =  2/3. 

We  now  compare  the  predicted  efficiency  shown  in  fig¬ 
ure  13  against  the  experimentally  measured  efficiency 
shown  in  figure  9.  We  can  see  that  there  is  good  agree¬ 
ment.  Also,  the  overhead  of  small  messages,  which  we 
mentioned  earlier,  does  not  affect  the  predicted  efficiency 
in  this  case  because  the  subregion  per  processor  is  large, 
N  =  125^  in  2D,  and  25^  in  3D.  Overall,  we  find  reason¬ 
able  agreement  between  the  theoretical  model  and  the 
experimental  measurements  of  parallel  efficiency.  The 
model  can  be  improved  further,  if  desired,  by  employing 
more  sophisticated  expressions  for  the  communication 
time  Tcom  in  equation  19  which  describes  the  behavior 
of  the  shared-bus  Ethernet  network. 

9  Conclusion 

A  promising  approach  of  simulating  fluid  dynamics  on 
a  cluster  of  non-dedicated  workstations  has  been  pre¬ 
sented.  The  approach  is  particularly  good  for  simulating 
subsonic  flow  which  involves  both  hydrodynamics  and 
acoustic  waves.  A  parallel  simulation  system  has  been 
developed  and  applied  to  solve  a  real  problem,  the  sim¬ 
ulation  of  air  flow  inside  wind  musical  instruments. 

The  system  achieves  concurrency  by  decomposing  the 
flow  problem  into  subregions,  and  by  assigning  the  sub- 
regions  to  parallel  subprocesses  on  different  worksta¬ 
tions.  The  use  of  explicit  numerical  methods  leads 
to  minimum  communication  requirements.  The  paral¬ 
lel  processes  automatically  migrate  from  busy  hosts  to 
free  hosts  in  order  to  exploit  the  unused  cycles  of  non- 
dedicated  workstations,  and  to  avoid  disturbing  the  reg¬ 
ular  users.  Typical  simulations  achieve  80%  parallel  ef¬ 
ficiency  (speedup/processors)  using  20  HP- Apollo  work¬ 
stations. 

Detailed  measurements  of  the  parallel  efficiency  of  2D 
and  3D  simulations  have  been  presented,  and  a  the¬ 
oretical  model  of  efficiency  has  been  developed  which 


fits  closely  the  measurements.  The  measurements  show 
that  a  shared-bus  Ethernet  network  with  10Mbps  peak 
bandwidth  (megabits  per  second)  is  sufficient  for  two- 
dimensional  simulations  of  subsonic  flow,  but  is  limited 
for  three-dimensional  simulations.  It  is  expected  that 
the  use  of  new  technologies  in  the  near  future  such  as 
Ethernet  switches.  FDDI  and  ATM  networks  will  make 
practical  three-dimensional  simulations  of  subsonic  flow 
on  a  cluster  of  workstations. 
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Appendix 

The  appendix  describes  certain  aspects  of  our  dis¬ 
tributed  system  that  are  not  vital  for  a  general  reading, 
but  are  useful  to  someone  who  is  interested  in  imple¬ 
menting  a  distributed  system  similar  to  ours. 

A  Un-synchronization  of  processes 

The  synchronization  between  parallel  processes  that  we 
discussed  in  section  4.2  can  be  violated  in  situations  such 
as  the  following.  Let  us  suppose  that  process  A  stops 
execution  after  communicating  its  data  for  integration 
step  N.  The  nearest  neighbor  B  can  integrate  up  to 
step  -|-  1  and  then  stop.  Process  B  can  not  integrate 
any  further  without  receiving  data  for  integration  step 
A'  -h  1  from  process  A.  However,  the  next  to  nearest 
neighbor  can  integrate  up  to  step  A  +  2,  and  so  on.  If 
we  consider  a  two-dimensional  decomposition  {J  x  K) 
of  a  problem,  the  largest  difference  in  integration  step 
between  two  processes  is  A  A, 

AN  =  max{J,  K)  -  1  (22) 

assuming  that  neighbors  depend  on  each  other  along  the 
diagonal  direction  (this  corresponds  to  a  full  stencil  of 
local  interactions  as  shown  in  figure  4).  If  neighbors 
depend  on  each  other  along  the  horizontal  and  verti¬ 
cal  directions  only  (this  is  the  star  stencil  of  figure  4), 
then  the  largest  difference  in  integration  step  between 
two  processes  becomes, 

AN  =  ( J  -  1)  -h  (A  -  1) 


(23) 


These  worst  cases  of  un-synchronization  are  important 
when  we  consider  the  migration  of  processes  because  a 
precise  global  synchronization  is  required  then,  as  is  ex¬ 
plained  in  section  5. 

B  Synchronization  algorithm 

The  synchronization  algorithm  that  is  used  during  pro¬ 
cess  migration  (see  section  5)  is  as  follows.  First,  a  syn¬ 
chronization  request  is  sent  to  all  the  processes  by  means 
of  a  UNIX  interrupt.  In  response  to  the  request,  every 
process  writes  the  current  integration  time  step  into  a 
shared  file  (using  file  locking  semaphores,  and  append 
mode).  Then,  every  process  examines  the  shared  file  to 
find  the  largest  integration  time  step  Tmax  among  all  the 
processes.  Further,  every  process  chooses  (Tmax+  1)  to 
be  the  upcoming  synchronization  time  step,  and  contin¬ 
ues  running  until  it  reaches  this  time  step.  It  is  impor¬ 
tant  that  all  the  processes  can  reach  the  synchronization 
time  step,  and  that  no  process  continues  past  the  syn¬ 
chronization  time  step. 

The  above  algorithm  finds  the  smallest  synchroniza¬ 
tion  time  step  that  is  possible  at  any  given  time,  so  that 
a  pending  migration  can  take  place  as  soon  as  possible. 

C  Order  of  communication 

A  minor  efficiency  issue  with  regard  to  TCP/IP  commu¬ 
nication  (see  section  4.2)  is  the  order  in  which  the  neigh¬ 
boring  processes  communicate  with  each  other.  One 
way  is  for  each  parallel  process  to  communicate  with 
its  neighbors  on  a  first-come-first-served  basis.  An  alter¬ 
native  way  is  to  impose  a  strict  ordering  on  the  way  the 
processes  communicate  with  each  other.  For  example,  we 
consider  a  one-dimensional  decomposition  (J  x  1)  of  a 
problem  with  non-periodic  outer-boundaries  where  each 
process  receives  data  from  its  left  neighbor  before  it  can 
send  data  to  its  right  neighbor.  Then,  the  leftmost  pro¬ 
cess  No.  1  will  access  the  network  first,  and  the  nearest- 
neighbor  process  No.  2  will  access  the  network  second, 
and  so  on.  The  intent  of  such  ordering  is  to  pipeline  the 
messages  through  the  shared-bus  network  in  a  strict  fash¬ 
ion  in  an  attempt  to  improve  performance.  However,  it 
does  not  work  very  well  if  one  process  is  delayed  because 
all  the  other  processes  are  delayed  also.  Small  delays 
are  inevitable  in  time-sharing  UNIX  systems,  and  strict 
ordering  amplifies  them  to  global  delays.  By  contrast, 
asynchronous  first-come-first-served  communication  al¬ 
lows  the  computation  to  proceed  in  those  processes  that 


are  not  delayed,  and  better  performance  is  achieved  over¬ 
all.  In  our  system  we  implement  first-come-first-served 
communication  using  the  “select”  system  call  of  sockets 
(see  UNIX  manual). 

D  Other  communication  mechanisms 

In  section  4.2  we  described  the  communication  mecha¬ 
nism  of  our  system  which  is  based  on  the  TCP/IP  pro¬ 
tocol  and  sockets.  Apart  from  the  TCP/IP  protocol, 
another  protocol  that  is  popular  in  distributed  systems 
is  the  UDP/IP  protocol,  also  known  as  datagrams.  The 
UDP/IP  protocol  is  similar  to  TCP/IP  with  one  major 
difference:  There  is  no  guaranteed  delivery  of  messages. 
Thus,  the  distributed  program  must  check  that  messages 
are  delivered,  and  resend  messages  if  necessary,  which 
is  a  considerable  effort.  However,  the  benefit  is  that 
the  distributed  program  has  more  control  of  the  com¬ 
munication.  For  example,  a  distributed  program  could 
take  advantage  of  knowing  the  special  properties  of  its 
own  communication  to  achieve  better  results  than  the 
TCP/IP  standard.  Also,  another  advantage  is  robust¬ 
ness  in  the  case  of  network  errors  that  occur  under  very 
high  network  traffic.  For  example,  when  TCP/IP  fails, 
it  is  hard  to  know  which  messages  need  to  be  resent. 
In  UDP/IP  the  distributed  program  controls  precisely 
which  data  is  sent  and  when,  so  that  the  failure  problem 
is  handled  directly.  Despite  these  advantages  of  UDP/IP 
over  TCP/IP,  we  have  chosen  to  work  with  TCP/IP  be¬ 
cause  of  its  simplicity. 

E  Performance  bugs  to  avoid 

In  section  7  we  presented  measurements  of  the  perfor¬ 
mance  of  our  workstations.  Here,  we  note  that  the  per¬ 
formance  of  the  HP9000/700  Apollo  workstations  can 
degrade  dramatically  at  certain  grid  sizes  by  a  factor  of 
two  or  more,  but  there  is  an  easy  way  to  fix  the  problem. 
The  loss  of  performance  occurs  when  the  length  of  the 
arrays  in  the  program  is  a  near  multiple  of  4096  bytes 
which  is  also  the  virtual- memory  page  size.  This  suggests 
that  the  loss  of  performance  is  related  to  the  prefetching 
algorithm  of  the  CPU  cache  of  the  HP9000/700  comput¬ 
ers.  To  avoid  the  loss  of  performance,  we  lengthen  our 
arrays  with  200-300  bytes  when  their  length  is  a  near 
multiple  of  4096.  This  modification  eliminates  the  loss 
of  performance. 
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